Deep Neural Networks: Deep Learning


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

Machine Learning vs. Deep Learning


  • State-of-the-art until 2012


  • Deep supervised learning



Deep Artificial Neural Networks

  • Complex/Nonlinear universal function approximator
    • Linearly connected networks
    • Simple nonlinear neurons
  • Hidden layers
    • Autonomous feature learning




1. Training NN: Backpropagation

$=$ Learning or estimating weights and biases of multi-layer perceptron from training data

1.1. Optimization

3 key components

  1. objective function $f(\cdot)$
  2. decision variable or unknown $\omega$
  3. constraints $g(\cdot)$

In mathematical expression



$$\begin{align*} \min_{\omega} \quad &f(\omega) \end{align*} $$

1.2. Loss Function

  • Measures error between target values and predictions


$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$

  • Example
    • Squared loss (for regression): $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
    • Cross entropy (for classification): $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

1.3. Learning

Learning weights and biases from data using gradient descent


$$\omega \Leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$
  • $\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$
  • Structural constraints of NN:
    • Composition of functions
    • Chain rule
    • Dynamic programming

<font color = 'red', font size = 4>Backpropagation</font>

  • Forward propagation
    • the initial information propagates up to the hidden units at each layer and finally produces output
  • Backpropagation

    • allows the information from the cost to flow backwards through the network in order to compute the gradients
  • Chain Rule

    • Computing the derivative of the composition of functions

      • $\space f(g(x))' = f'(g(x))g'(x)$

      • $\space {dz \over dx} = {dz \over dy} \bullet {dy \over dx}$

      • $\space {dz \over dw} = ({dz \over dy} \bullet {dy \over dx}) \bullet {dx \over dw}$

      • $\space {dz \over du} = ({dz \over dy} \bullet {dy \over dx} \bullet {dx \over dw}) \bullet {dw \over du}$

  • Backpropagation

    • Update weights recursively with memory

Optimization procedure


  • It is not easy to numerically compute gradients in network in general.
    • The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
    • There are a wide range of tools: TensorFlow

2. Vanishing Gradient

  • The Vanishing Gradient Problem

  • As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.

  • For example,
$$\frac{z}{u} = \frac{z}{y} \cdot \frac{y}{x} \cdot \frac{x}{\omega} \cdot \frac{\omega}{u} $$




  • Rectifiers
  • The use of the ReLU activation function was a great improvement compared to the historical tanh.




  • This can be explained by the derivative of ReLU itself not vanishing, and by the resulting coding being sparse (Glorot et al., 2011).




3. Gradient Descent in Deep Learning

  • Gradient Descent
$$\text{Repeat : } x \leftarrow x - \alpha \nabla _x f(x) \quad \quad \text{for some step size } \alpha > 0$$



In this lecture, we will cover gradient descent algorithm and its variants:

  • Batch Gradient Descent
  • Stochastic Gradient Descent
  • Mini-batch Gradient Descent

We will explore the concept of these three gradient descent algorithms with a logistic regression model.

3.1. Batch Gradient Descent

(= Gradient Descent)

In Batch Gradient Descent methods, we use all of the training data set for each iteration. So, for each update, we have to sum over all examples.


$$\mathcal{E} (\omega) = \frac{1}{m} \sum_{i=1}^{m} \ell (\hat y_i, y_i) = \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)$$

By linearity,


$$\nabla_{\omega} \mathcal{E} = \nabla_{\omega} \frac{1}{m} \sum_{i=1}^{m} \ell (h_{\omega}(x_i), y_i)= \frac{1}{m} \sum_{i=1}^{m} \frac{\partial }{\partial \omega}\ell (h_{\omega}(x_i), y_i)$$


$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$

The main advantages:

  • We can use fixed learning rate during training without worring about learining rate decay.
  • It has straight trajectory towards the minimum
  • It guaranteed to converge in theory to the global minimum if the loss function is convex and to a local minimum if the loss function is not convex.
  • It has unbiased estimate of gradients. As increasing the number of examples, standard error will decreasing.

Even if it is safe and accurate method, it is very inefficient in terms of computation.

The main disadvantages:

  • When we have large data set, this method may slow to converge.
  • Each step of learning happens after going over all examples where some examples may be redundant and don’t contribute much to the update.

3.2. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an extreme case of Mini-batch Gradient Descent. In this case, learning happens on every example. This is less common than the mini-batch gradient descent method.

Update the parameters based on the gradient for a single training example:


$$f(\omega) = \ell (\hat y_i, y_i) = \ell (h_{\omega}(x_i), y_i) = \ell^{(i)}$$


$$\omega \leftarrow \omega - \alpha \, \frac{\partial \ell^{(i)}}{\partial \omega}$$

The advantages:

  • It adds even more noise to the learning process than mini-batch that helps improving generalization error.

The disadvantages:

  • It use only one example for each update. But one example hardly can represent whole data set. In other words, the variance becomes large since we only use one example for each learning step.
  • Due to the noise, the learning steps have more oscillations.
  • It become very slow since we can't utilize vectorization over only one example.

Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient:


$$\mathbb{E} \left[\frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial \ell^{(i)}}{\partial \omega} = \frac{\partial }{\partial \omega} \left[ \frac{1}{m} \sum_{i=1}^{m} \ell^{(i)} \right] = \frac{\partial \mathcal{E}}{\partial \omega}$$

Below is a graph that shows the gradient descent's variants and their direction towards the minimum:



As we can see in figure, SGD direction is very noisy compared to others.

3.3. Mini-batch Gradient Descent

Mini-batch Gradient Descent method is often used in machine learning and deep learning training. The main idea is similar with batch gradient descent. However, unlike batch gradient descent, in this method, we can customize the batch size $s$. Instead of going over all examples $m$, mini-batch gradient descent sums up over lower number of examples based on the batch size $s \;(< m)$. So, parameters are updated based on mini-batch for each iteration. Since we assume the examples in the data set has positive correlation, this method can be used. In other words, in large data set, there are lots of similar examples.


$$\mathcal{E} (\omega) = \frac{1}{s} \sum_{i=1}^{s} \ell (\hat y_i, y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell (h_{\omega}(x_i), y_i) = \frac{1}{s} \sum_{i=1}^{s} \ell^{(i)}$$


$$\omega \leftarrow \omega - \alpha \, \nabla_{\omega} \mathcal{E} $$

Stochastic gradients computed on larger mini-batches have smaller variance:


$$\text{var} \left[ \frac{1}{s} \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s^2} \text{var} \left[ \sum_{i=1}^{s} \frac{\partial \ell^{(i)}}{\partial \omega} \right] = \frac{1}{s} \text{var} \left[ \frac{\partial \ell^{(i)}}{\partial \omega} \right]$$

The mini-batch size 𝑠 is a hyper-parameter that needs to be set.

The main advantages of SGD:

  • Faster than batch gradient descent. Since it goas through a lot less examples than batch gradient descent(all examples).
  • Since we randomly choose the mini-batch examples, we can avoid redundant examples and examples that are very similar that don't contribute much to the learning.
  • With batch size $<$ size of training set, it can adds noise to the learning process that helps improving generaization error.

The main disadvantages of SGD:

  • It won't converge. On each iteration, the learning step mat go back and forth due to the noise. So, it wanders around the minimum region but never converges.
  • Due to the noise, the learning steps have more oscillations.



Note that if batch size is equal to number of training examples, mini-batch gradient descent method is same with batch gradient descent.

Summary



- No guarantee that the below is what is going to always happen. But the noisy SGC gradients can help some times escaping local optima.

3.4. Advanced Optimizers from SGD

4. Regularization and NN Techniques

4.1. Data Augmentation

  • Big Data
  • Data augmentation
    • The simplest way to reduce overfitting is to increase the size of the training data




4.2. Early stopping

  • Early stopping
    • When we see that the performance on the validation set is getting worse, we immediately stop the training on the model




4.3. Dropout

  • This is the one of the most interesting types of regularization techniques.
  • It also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
  • At every iteration, it randomly selects some nodes and removes them.
  • It can also be thought of as an ensemble technique in machine learning.




  • tf.nn.dropout(layer, rate = p)
  • For training
    • rate: the probability that each element is dropped. For example, setting rate = 0.1 would drop 10% of input elements with probability rate, drops elements of layers. Input that are kept are scaled up by $\frac{1}{1−\text{rate}}$, otherwise outputs 0. The scaling is so that the expected sum is unchanged.
  • For testing
    • All the elements are kept

4.4. Batch Normalization

Batch normalization is a technique for improving the performance and stability of artificial neural networks.

It is used to normalize the input layer by adjusting and scaling the activations.




  • During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.

  • During test, it simply shifts and rescales according to the empirical moments estimated during training.

5. Tensorflow: DL Framework

5.1. DL Frameworks

Tensorflow

Keras

PyTorch

5.2. Tensorflow

  • TensorFlow is an open-source software library for deep learning.

It’s a framework to perform computation very efficiently, and it can tap into the GPU (Graphics Processor Unit) in order to speed it up even further. This will make a huge effect as we shall see shortly. TensorFlow can be controlled by a simple Python API.

Tensorflow is one of the widely used libraries for implementing machine learning and other algorithms involving large number of mathematical operations. Tensorflow was developed by Google and it’s one of the most popular Machine Learning libraries on GitHub. Google uses Tensorflow for implementing Machine learning in almost all applications.

Tensor

TensorFlow gets its name from tensors, which are arrays of arbitrary dimensionality. A vector is a 1-d array and is known as a 1st-order tensor. A matrix is a 2-d array and a 2nd-order tensor. The "flow" part of the name refers to computation flowing through a graph. Training and inference in a neural network, for example, involves the propagation of matrix computations through many nodes in a computational graph.


To run any of the three defined operations, we need to create a session for that graph. The session will also allocate memory to store the current value of the variable.

When you think of doing things in TensorFlow, you might want to think of creating tensors (like matrices), adding operations (that output other tensors), and then executing the computation (running the computational graph). In particular, it's important to realize that when you add an operation on tensors, it doesn't execute immediately. Rather, TensorFlow waits for you to define all the operations you want to perform. Then, TensorFlow optimizes the computation graph, deciding how to execute the computation, before generating the data. Because of this, a tensor in TensorFlow isn't so much holding the data as a placeholder for holding the data, waiting for the data to arrive when a computation is executed.


5.3. Computational Graph

  • tf.constant
  • tf.Variable
  • tf.placeholder

tf.constant

tf.constant creates a constant tensor specified by value, dtype, shape and so on.

In [1]:
import tensorflow as tf

a = tf.constant([1,2,3])
b = tf.constant(4, shape=[1,3])

A = a + b
B = a*b

The result of the lines of code is an abstract tensor in the computation graph. However, contrary to what you might expect, the result doesn’t actually get calculated. It just defined the model, but no process ran to calculate the result.

In [2]:
A
Out[2]:
<tf.Tensor 'add:0' shape=(1, 3) dtype=int32>
In [3]:
B
Out[3]:
<tf.Tensor 'mul:0' shape=(1, 3) dtype=int32>
In [4]:
sess = tf.Session()
sess.run(A)
Out[4]:
array([[5, 6, 7]], dtype=int32)
In [5]:
sess.run(B)
Out[5]:
array([[ 4,  8, 12]], dtype=int32)

You can also use the following lines of code to start up an interactive Session, run the result and close the Session automatically again after printing the output:

In [6]:
a = tf.constant([1,2,3])
b = tf.constant([4,5,6])

result = tf.multiply(a, b)

with tf.Session() as sess:
    output = sess.run(result)
    print(output)
[ 4 10 18]

tf.Variable

tf.Variable is regarded as the decision variable in optimization. We should initialize variables to use tf.Variable.

In [7]:
x1 = tf.Variable([1, 1], dtype = tf.float32)
x2 = tf.Variable([2, 2], dtype = tf.float32)
y = x1 + x2

print(y)
Tensor("add_1:0", shape=(2,), dtype=float32)
In [8]:
sess = tf.Session()

init = tf.global_variables_initializer()
sess.run(init)

sess.run(y)
Out[8]:
array([3., 3.], dtype=float32)

tf.placeholder

The value of tf.placeholder must be fed using the feed_dict optional argument to Session.run().

In [9]:
sess = tf.Session()
x = tf.placeholder(tf.float32, shape = [2,2])

sess.run(x, feed_dict = {x : [[1,2],[3,4]]})
Out[9]:
array([[1., 2.],
       [3., 4.]], dtype=float32)
In [10]:
a = tf.placeholder(tf.float32, shape = [2])
b = tf.placeholder(tf.float32, shape = [2])

sum = a + b

sess.run(sum, feed_dict = {a : [1,2], b : [3,4]})
Out[10]:
array([4., 6.], dtype=float32)

6. ANN Implementation with Dropout and BN

In [11]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

Overfitting in Regression

In [12]:
N = 10
data_x = np.linspace(-4.5, 4.5, N)
data_y = np.array([0.9819, 0.7973, 1.9737, 0.1838, 1.3180, -0.8361, -0.6591, -2.4701, -2.8122, -6.2512])

data_x = data_x.reshape(-1,1)
data_y = data_y.reshape(-1,1)

plt.figure(figsize = (10,8))
plt.plot(data_x, data_y, 'o')
plt.grid(alpha = 0.3)
plt.show()
In [13]:
n_input = 1
n_hidden1 = 30
n_hidden2 = 100
n_hidden3 = 100
n_hidden4 = 30
n_output = 1
In [14]:
weights = {
    'hidden1' : tf.Variable(tf.random_normal([n_input, n_hidden1], stddev = 0.1)),
    'hidden2' : tf.Variable(tf.random_normal([n_hidden1, n_hidden2], stddev = 0.1)),
    'hidden3' : tf.Variable(tf.random_normal([n_hidden2, n_hidden3], stddev = 0.1)),
    'hidden4' : tf.Variable(tf.random_normal([n_hidden3, n_hidden4], stddev = 0.1)),
    'output' : tf.Variable(tf.random_normal([n_hidden4, n_output], stddev = 0.1)),
}

biases = {
    'hidden1' : tf.Variable(tf.random_normal([n_hidden1], stddev = 0.1)),
    'hidden2' : tf.Variable(tf.random_normal([n_hidden2], stddev = 0.1)),
    'hidden3' : tf.Variable(tf.random_normal([n_hidden3], stddev = 0.1)),
    'hidden4' : tf.Variable(tf.random_normal([n_hidden4], stddev = 0.1)),
    'output' : tf.Variable(tf.random_normal([n_output], stddev = 0.1)),
}
In [15]:
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
In [16]:
def build_model(x, weights, biases):
    hidden1 = tf.add(tf.matmul(x, weights['hidden1']), biases['hidden1'])
    hidden1 = tf.nn.sigmoid(hidden1)
    
    hidden2 = tf.add(tf.matmul(hidden1, weights['hidden2']), biases['hidden2'])
    hidden2 = tf.nn.sigmoid(hidden2)
 
    hidden3 = tf.add(tf.matmul(hidden2, weights['hidden3']), biases['hidden3'])
    hidden3 = tf.nn.sigmoid(hidden3)
    
    hidden4 = tf.add(tf.matmul(hidden3, weights['hidden4']), biases['hidden4'])
    hidden4 = tf.nn.sigmoid(hidden4)
    
    output = tf.add(tf.matmul(hidden4, weights['output']), biases['output'])
    return output
In [17]:
pred = build_model(x, weights, biases)
loss = tf.square(pred - y)
loss = tf.reduce_mean(loss)

LR = 0.001
optm = tf.train.AdamOptimizer(LR).minimize(loss)
In [18]:
n_batch = 50    
n_iter = 10000 
n_prt = 1000    

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

loss_record = []

for epoch in range(n_iter):
    idx = np.random.randint(N, size = n_batch)
    train_x = data_x[idx,:]
    train_y = data_y[idx,:]
    
    sess.run(optm, feed_dict = {x: train_x,  y: train_y})
    
    if epoch % n_prt == 0:
        c = sess.run(loss, feed_dict = {x: train_x, y: train_y})
        loss_record.append(c)
In [19]:
plt.figure(figsize = (10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record, label = 'training')
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.grid(alpha = 0.3)
plt.legend(fontsize = 12)
plt.ylim([0, 10])
plt.show() 
In [20]:
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = sess.run(pred, feed_dict = {x: xp})

plt.figure(figsize = (10,8))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()

Dropout Implementation

In [21]:
p = tf.placeholder(tf.float32)
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
In [22]:
def build_model(x, weights, biases, p):    
    hidden1 = tf.add(tf.matmul(x, weights['hidden1']), biases['hidden1'])    
    hidden1 = tf.nn.sigmoid(hidden1)    
    dropout1 = tf.nn.dropout(hidden1, rate = p)
    
    hidden2 = tf.add(tf.matmul(dropout1, weights['hidden2']), biases['hidden2'])
    hidden2 = tf.nn.sigmoid(hidden2)    
    dropout2 = tf.nn.dropout(hidden2, rate = p)
    
    hidden3 = tf.add(tf.matmul(dropout2, weights['hidden3']), biases['hidden3'])
    hidden3 = tf.nn.sigmoid(hidden3)    
    dropout3 = tf.nn.dropout(hidden3, rate = p)
    
    hidden4 = tf.add(tf.matmul(dropout3, weights['hidden4']), biases['hidden4'])
    hidden4 = tf.nn.sigmoid(hidden4)    
    dropout4 = tf.nn.dropout(hidden4, rate = p)
    
    output = tf.add(tf.matmul(dropout4, weights['output']), biases['output'])
    return output
In [23]:
pred = build_model(x, weights, biases, p)
loss = tf.square(pred - y)
loss = tf.reduce_mean(loss)

LR = 0.001
optm = tf.train.AdamOptimizer(LR).minimize(loss)
In [24]:
n_batch = 50 
n_iter = 10000
n_prt = 1000  

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

loss_record = []
for epoch in range(n_iter):
    idx = np.random.randint(N, size = n_batch)
    train_x = data_x[idx,:]
    train_y = data_y[idx,:]
    
    sess.run(optm, feed_dict = {x: train_x,  y: train_y, p: 0.2})
    
    if epoch % n_prt == 0:
        c = sess.run(loss, feed_dict = {x: train_x, y: train_y, p: 0.2})
        loss_record.append(c)
        #print ("Iter : {}".format(epoch))
        #print ("Train Cost : {}".format(c))        
In [25]:
plt.figure(figsize = (10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record, label = 'training')
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.grid('on', alpha = 0.3)
plt.legend(fontsize = 12)
plt.ylim([0, 10])
plt.show()        
In [26]:
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = sess.run(pred, feed_dict = {x: xp, p: 0})

plt.figure(figsize = (10,8))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()

Batch Normalization Implementation

In [27]:
is_training = tf.placeholder(tf.bool)
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
In [28]:
def build_model(x, weights, biases, is_training):    
    hidden1 = tf.add(tf.matmul(x, weights['hidden1']), biases['hidden1'])    
    hidden1 = tf.layers.batch_normalization(hidden1, training = is_training)
    hidden1 = tf.nn.sigmoid(hidden1)    
    
    hidden2 = tf.add(tf.matmul(hidden1, weights['hidden2']), biases['hidden2'])
    hidden2 = tf.layers.batch_normalization(hidden2, training = is_training)
    hidden2 = tf.nn.sigmoid(hidden2)    
    
    hidden3 = tf.add(tf.matmul(hidden2, weights['hidden3']), biases['hidden3'])
    hidden3 = tf.layers.batch_normalization(hidden3, training = is_training)
    hidden3 = tf.nn.sigmoid(hidden3)    
    
    hidden4 = tf.add(tf.matmul(hidden3, weights['hidden4']), biases['hidden4'])
    hidden4 = tf.layers.batch_normalization(hidden4, training = is_training)
    hidden4 = tf.nn.sigmoid(hidden4)
    
    output = tf.add(tf.matmul(hidden4, weights['output']), biases['output'])
    return output
In [29]:
pred = build_model(x, weights, biases, is_training)
loss = tf.square(pred - y)
loss = tf.reduce_mean(loss)

LR = 0.001
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    optm = tf.train.AdamOptimizer(LR).minimize(loss)
WARNING: Logging before flag parsing goes to stderr.
W0109 22:00:42.440064 140241494570816 deprecation.py:323] From <ipython-input-28-bf173161ddb7>:3: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).
In [30]:
tf.get_default_graph().get_all_collection_keys()
Out[30]:
['trainable_variables', 'variables', 'train_op', 'cond_context', 'update_ops']
In [31]:
tf.get_collection('trainable_variables')
Out[31]:
[<tf.Variable 'Variable:0' shape=(2,) dtype=float32_ref>,
 <tf.Variable 'Variable_1:0' shape=(2,) dtype=float32_ref>,
 <tf.Variable 'Variable_2:0' shape=(1, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_3:0' shape=(30, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_4:0' shape=(100, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_5:0' shape=(100, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_6:0' shape=(30, 1) dtype=float32_ref>,
 <tf.Variable 'Variable_7:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_8:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_9:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_10:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_11:0' shape=(1,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/gamma:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/beta:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/gamma:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/beta:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/gamma:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/beta:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/gamma:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/beta:0' shape=(30,) dtype=float32_ref>]
In [32]:
tf.get_collection('variables')
Out[32]:
[<tf.Variable 'Variable:0' shape=(2,) dtype=float32_ref>,
 <tf.Variable 'Variable_1:0' shape=(2,) dtype=float32_ref>,
 <tf.Variable 'Variable_2:0' shape=(1, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_3:0' shape=(30, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_4:0' shape=(100, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_5:0' shape=(100, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_6:0' shape=(30, 1) dtype=float32_ref>,
 <tf.Variable 'Variable_7:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_8:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_9:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_10:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_11:0' shape=(1,) dtype=float32_ref>,
 <tf.Variable 'beta1_power:0' shape=() dtype=float32_ref>,
 <tf.Variable 'beta2_power:0' shape=() dtype=float32_ref>,
 <tf.Variable 'Variable_2/Adam:0' shape=(1, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_2/Adam_1:0' shape=(1, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_3/Adam:0' shape=(30, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_3/Adam_1:0' shape=(30, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_4/Adam:0' shape=(100, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_4/Adam_1:0' shape=(100, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_5/Adam:0' shape=(100, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_5/Adam_1:0' shape=(100, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_6/Adam:0' shape=(30, 1) dtype=float32_ref>,
 <tf.Variable 'Variable_6/Adam_1:0' shape=(30, 1) dtype=float32_ref>,
 <tf.Variable 'Variable_7/Adam:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_7/Adam_1:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_8/Adam:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_8/Adam_1:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_9/Adam:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_9/Adam_1:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_10/Adam:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_10/Adam_1:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_11/Adam:0' shape=(1,) dtype=float32_ref>,
 <tf.Variable 'Variable_11/Adam_1:0' shape=(1,) dtype=float32_ref>,
 <tf.Variable 'beta1_power_1:0' shape=() dtype=float32_ref>,
 <tf.Variable 'beta2_power_1:0' shape=() dtype=float32_ref>,
 <tf.Variable 'Variable_2/Adam_2:0' shape=(1, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_2/Adam_3:0' shape=(1, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_3/Adam_2:0' shape=(30, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_3/Adam_3:0' shape=(30, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_4/Adam_2:0' shape=(100, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_4/Adam_3:0' shape=(100, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_5/Adam_2:0' shape=(100, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_5/Adam_3:0' shape=(100, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_6/Adam_2:0' shape=(30, 1) dtype=float32_ref>,
 <tf.Variable 'Variable_6/Adam_3:0' shape=(30, 1) dtype=float32_ref>,
 <tf.Variable 'Variable_7/Adam_2:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_7/Adam_3:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_8/Adam_2:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_8/Adam_3:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_9/Adam_2:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_9/Adam_3:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_10/Adam_2:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_10/Adam_3:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_11/Adam_2:0' shape=(1,) dtype=float32_ref>,
 <tf.Variable 'Variable_11/Adam_3:0' shape=(1,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/gamma:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/beta:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/moving_mean:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/moving_variance:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/gamma:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/beta:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/moving_mean:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/moving_variance:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/gamma:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/beta:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/moving_mean:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/moving_variance:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/gamma:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/beta:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/moving_mean:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/moving_variance:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'beta1_power_2:0' shape=() dtype=float32_ref>,
 <tf.Variable 'beta2_power_2:0' shape=() dtype=float32_ref>,
 <tf.Variable 'Variable_2/Adam_4:0' shape=(1, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_2/Adam_5:0' shape=(1, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_3/Adam_4:0' shape=(30, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_3/Adam_5:0' shape=(30, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_4/Adam_4:0' shape=(100, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_4/Adam_5:0' shape=(100, 100) dtype=float32_ref>,
 <tf.Variable 'Variable_5/Adam_4:0' shape=(100, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_5/Adam_5:0' shape=(100, 30) dtype=float32_ref>,
 <tf.Variable 'Variable_6/Adam_4:0' shape=(30, 1) dtype=float32_ref>,
 <tf.Variable 'Variable_6/Adam_5:0' shape=(30, 1) dtype=float32_ref>,
 <tf.Variable 'Variable_7/Adam_4:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_7/Adam_5:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_8/Adam_4:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_8/Adam_5:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_9/Adam_4:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_9/Adam_5:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'Variable_10/Adam_4:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_10/Adam_5:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'Variable_11/Adam_4:0' shape=(1,) dtype=float32_ref>,
 <tf.Variable 'Variable_11/Adam_5:0' shape=(1,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/gamma/Adam:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/gamma/Adam_1:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/beta/Adam:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization/beta/Adam_1:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/gamma/Adam:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/gamma/Adam_1:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/beta/Adam:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_1/beta/Adam_1:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/gamma/Adam:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/gamma/Adam_1:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/beta/Adam:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_2/beta/Adam_1:0' shape=(100,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/gamma/Adam:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/gamma/Adam_1:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/beta/Adam:0' shape=(30,) dtype=float32_ref>,
 <tf.Variable 'batch_normalization_3/beta/Adam_1:0' shape=(30,) dtype=float32_ref>]
In [33]:
tf.get_collection('update_ops')
Out[33]:
[<tf.Operation 'batch_normalization/cond_2/Merge' type=Merge>,
 <tf.Operation 'batch_normalization/cond_3/Merge' type=Merge>,
 <tf.Operation 'batch_normalization_1/cond_2/Merge' type=Merge>,
 <tf.Operation 'batch_normalization_1/cond_3/Merge' type=Merge>,
 <tf.Operation 'batch_normalization_2/cond_2/Merge' type=Merge>,
 <tf.Operation 'batch_normalization_2/cond_3/Merge' type=Merge>,
 <tf.Operation 'batch_normalization_3/cond_2/Merge' type=Merge>,
 <tf.Operation 'batch_normalization_3/cond_3/Merge' type=Merge>]
In [34]:
n_batch = 50 
n_iter = 10000
n_prt = 1000  

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

loss_record = []
for epoch in range(n_iter):
    idx = np.random.randint(N, size = n_batch)
    train_x = data_x[idx,:]
    train_y = data_y[idx,:]
    
    sess.run(optm, feed_dict = {x: train_x,  y: train_y, is_training: True})
    
    if epoch % n_prt == 0:
        c = sess.run(loss, feed_dict = {x: train_x, y: train_y, is_training: True})
        loss_record.append(c)
In [35]:
plt.figure(figsize = (10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record, label = 'training')
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.grid('on', alpha = 0.3)
plt.legend(fontsize = 12)
plt.ylim([0, 10])
plt.show()
In [36]:
xp = np.linspace(-4.5, 4.5, 100).reshape(-1,1)
my_pred = sess.run(pred, feed_dict = {x: xp, is_training: False})

plt.figure(figsize = (10,8))
plt.plot(data_x, data_y, 'o')
plt.plot(xp, my_pred, 'r')
plt.grid(alpha = 0.3)
plt.show()
In [37]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')